-
Notifications
You must be signed in to change notification settings - Fork 736
Az implementation #4061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: availability-zones
Are you sure you want to change the base?
Az implementation #4061
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## availability-zones #4061 +/- ##
======================================================
- Coverage 89.24% 89.14% -0.10%
======================================================
Files 246 246
Lines 15756 15796 +40
======================================================
+ Hits 14061 14081 +20
- Misses 1695 1715 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3763140 to
6295079
Compare
94e31e3 to
d597d38
Compare
src/daemon/daemon.cpp
Outdated
|
|
||
| if (vm->current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think we need a better way to represent "unavailability". My main concern is that this approach is error-prone and repetitive. The problem is that "unavailable" instances are still a part of the operative instances, which is wrong. An unavailable instance is not operative since it can't be put into an operative state unless it's made available, so an unavailable instance shouldn't be a part of the "operative" set of VMs in the first place.
Right now, we have two sets of VMs: "operative" and "deleted". I think what we need here is a third category, "unavailable". This way, we can just update the error messages as does not exist or unavailable and call it a day, until we overhaul the VM management logic. @ricab, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good point. I haven't thought a lot about it, but I think I agree with you @xmkg. We need to define also what happens at the intersection of deleted/unavailable i.e., deleted instances that belong to an unavailable AZ. I guess we just declare "deleted" takes precedence over "unavailable"? We need to answer a few questions:
- does deleting and instance unsubscribe it from the AZ?
- does enabling/disabling an AZ affect deleted instances? (I hope not)
- what happens on recover of an instance whose AZ is unavailable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ricab, good questions. Regarding the deleted + unavailable at the same time, I think it's reasonable to make deleted take precedence over unavailable. Another avenue might be to consolidate all instances to a single container and represent the "availability" state in a bit flag. This way, an instance can be deleted & unavailable at the same time.
Consolidating all instances into a single container carries a cost: lookup and iteration speed. This can be troublesome for users with many instances. Otherwise, std::ranges filtering would work just fine. If iteration speed matters, another option might be using a multi-index container, which allows for different indices over a set of values instead of just one.
Regarding the last three questions, those should be answered in the AZ spec itself, if not already. @Sploder12 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xmkg I fully agree we need a better way to represent "unavailability", it is known that my current approach is terrible. But, my issue with using operative instances and a new unavailable table is that we'd be moving even more responsibility into the daemon. Also operative instances is the most unsafe thing in our entire code base at the moment. It guarantees race conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does deleting and instance unsubscribe it from the AZ?
Based on spec and implementation, no. Although it'd be nice if this were possible.
does enabling/disabling an AZ affect deleted instances? (I hope not)
Yes, it is necessary for recover in the current implementation :)
what happens on recover of an instance whose AZ is unavailable?
This is unaccounted for by the spec. Implementation recovers the instance in an unavailable state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sploder12 understood. I reviewed the patch in my local environment as well, and it seems like there is no common place that we can place the "unavailable" check. The commands are inconsistent in how they retrieve the instances. Some of them use the operative_instances directly, whereas some use the select_instances function, and so on. We desperately need to refactor this because it can be implemented in a much simpler and safer way.
@ricab, shall we proceed as-is and refactor this later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.
Deleted instances vs AZs
I see two possible approaches:
-
instances remain subscribed on deletion
- deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
- arbitrary state -> unavailable (available -> unavailable is a particular case)
- unavailable -> available
- recovering a deleted instance does not touch the state
- deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
-
instances unsubscribe on deletion
- states don't change while an instance is deleted
- instances would need to keep track of their AZ in a different way
- recovering a deleted instance does two new things
- resubscribe it to its AZ
- update its state according to AZ state
If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.
So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?
Which container
operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).
I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.
Consolidating all instances into a single container carries a cost: lookup and iteration speed.
In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.
We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.
I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...
All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.
@ricab, shall we proceed as-is and refactor this later on?
Meh, probably... :/
src/daemon/daemon.cpp
Outdated
|
|
||
| if (vm->current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| return fmt::to_string(errors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe push a message into "errors", stating that a VM is unavailable here?
b8474ed to
89a39fd
Compare
d597d38 to
f161bd1
Compare
src/daemon/daemon.cpp
Outdated
|
|
||
| if (vm->current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking quite a bit about this and wrote a bunch (sorry). We need to discuss, but I think the effect of what follows will be small for this PR. But better have our ideas in order.
Deleted instances vs AZs
I see two possible approaches:
-
instances remain subscribed on deletion
- deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
- arbitrary state -> unavailable (available -> unavailable is a particular case)
- unavailable -> available
- recovering a deleted instance does not touch the state
- deleted instances' state is updated on AZ status change. The following changes are possible on a deleted instance
-
instances unsubscribe on deletion
- states don't change while an instance is deleted
- instances would need to keep track of their AZ in a different way
- recovering a deleted instance does two new things
- resubscribe it to its AZ
- update its state according to AZ state
If this was being implemented from scratch, I would probably prefer 2 and drop the AZ manager entity entirely, as discussed with Andrei in early AZ days. I still don't see the benefit, but it's not the time to U-turn now.
So, that leaves option 1. IIUC, that corresponds to what is implemented, correct @Sploder12?
Which container
operative_instances is just a name. The idea was to contrast it with deleted_instances and indicate that those are disjoint collections. Previously it was called instances (IIRC), which sounded like a superset. If necessary we could rename again (e.g. retained_instances).
I would love to have a single collection, as @xmkg suggested, with ranges and "deleted" as just another state in the FSM (same representation). It would simplify a lot of code.
Consolidating all instances into a single container carries a cost: lookup and iteration speed.
In the sizes we're we're talking about, I am pretty sure it would be OK, if we did it right. If the deleted state was directly available in a plain field of the VM class, I would be willing to bet a user with 1000 deleted instances and 1 other instance would notice no degradation of performance with one vs two hash tables. 1000 iterations of "if X continue" are really fast, especially if X is already cached. It might even work the other way around due to cache locality. And I'd bet very few people ever reached 100 instances anyway.
We would have to do it right though, in the sense that we would need to make sure we shortcut the state update when it was deleted, preferably minimizing virtual calls in that case (another case for template method). 1000 iterations of going to the backend to update the state would be very noticeable.
I suspect that perf did not play much of a role in the decision to have 2 collections in the first place. My guess is that it was just the immediate representation of a trash can in code. Idk...
All that said, this would be a very significant refactor. I wonder if AZs are the right place for it... Especially given that we don't have ranges yet.
@ricab, shall we proceed as-is and refactor this later on?
Meh, probably... :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Sploder12. Overall this looks close to what we need, but I have a few comments for discussion.
2331482 to
3c8321a
Compare
751ae6a to
56ad56f
Compare
93f4fc2 to
c6a9e32
Compare
f21b770 to
3098c8f
Compare
dc1613a to
dc819a6
Compare
3098c8f to
6c91f0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
| if (vm.current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, | ||
| vm.vm_name, | ||
| "Ignoring suspend since instance is unavailable."); | ||
| return grpc::Status::OK; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem: shouldn't this be handled by the VM? It would go in the direction of relieving the daemon from such concerns.
| m.available = true; | ||
| serialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just do this at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the catch is rethrowing, control flow would never reach the end of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. You could use a scope guard, but no need really.
| return; | ||
| } | ||
|
|
||
| was_running = state == State::running || state == State::starting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the state was restarting?
b20f21b to
4029149
Compare
26948b9 to
eeceaee
Compare
3655de5 to
a66e7a8
Compare
b7c9b32 to
7104ebb
Compare
2af4bcb to
acba722
Compare
7104ebb to
f00a058
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a few suggestions, but I will leave them to your consideration and pre-approve.
| name); | ||
| "Cannot start the instance '{}' while {}.", | ||
| name, | ||
| vm.current_state() == VirtualMachine::State::suspending ? "suspending" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be easier once the Hyper-V branch is in: https://github.com/canonical/multipass/pull/4080/files#diff-d847b18fbc46d755306da3eb6cac52de71043c1cc87ff345cdf8b034ad141ffaR158
You could add a TODO here, or convince @xmkg to split that off 😉
| if (state == State::unavailable) | ||
| { | ||
| throw VMStateIdempotentException{"Ignoring shutdown since instance is unavailable."}; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about merging with the above branch here as well. Again, the formatter would help. If you want to simplify, I wouldn't be shocked if the message read "Ignoring shutdown since instance is already unavailable".
| m.available = true; | ||
| serialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. You could use a scope guard, but no need really.
| } | ||
| else if (present_state == State::unavailable) | ||
| { | ||
| mpl::log(mpl::Level::info, vm_name, "Ignoring suspend since instance is unavailable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem: merge with the above?
Otherwise, use mpl::info?
| mpl::info(vm_name, "Ignoring suspend issued while stopped/suspended"); | ||
| mpl::log(mpl::Level::info, | ||
| vm_name, | ||
| "Ignoring suspend issued while stopped/suspended/unavailable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem: state formatting would help
| // setting the state here breaks encapsulation, but it's already broken. | ||
| std::unique_lock vm_lock{vm.get().state_mutex}; | ||
| if (vm.get().current_state() == VirtualMachine::State::unavailable) | ||
| { | ||
| vm.get().state = VirtualMachine::State::off; | ||
| vm.get().update_state(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered force-stop instead? That would avoid breaking encapsulation yet again and make sure there were no leftover processes.
This PR adds implementations for making VMs unavailable and available. Also has unavailable VMs ignore commands that change anything related to that VM. There are a few known race conditions that are outside the scope of AZs to fix.
MULTI-1789
MULTI-1941